pwd
cd ~/Desktop/
wget https://github.com/AntonioJBT/teaching_ICL/archive/master.zip
unzip master.zip
cd teaching_ICL-master
mkdir my_results
cd my_results
We’ll first need to specify in R where we want to run the analysis from. If you followed the instructions above you can set your working directory to the same location. This is much easier to do manually from the R console. Copy and paste the following:
setwd('~/Desktop/teaching_ICL-master/my_results/')
getwd()
Normally we would prefer to keep code, data and results in separate directories. Here we’re using R Notebook and the knitr package to run commands, create output and produce a nice looking report file which contains code, text and results together. This R notebook let us have it all in one place, which can sometimes be very useful, particularly for exploratory analysis.
We’ll setup a separate path to a ‘results’ directory for plots that we might want to save independently though. We also already have a separate ‘data’ folder where we keep our data sets.
In a standard R script we would simply use commands such as setwd(), source(), etc. An important caveat here is that, by default, knitr uses the same directory as your .Rmd file to find any files that are necessary to run the code (e.g. any data files) and, if any output is generated, it???ll be saved in that directory as well. This is set independently for each code chunk.
To make it easier we can add an output path to make sure we save to the directory we want:
path_results <- '../my_results/'
You may need to install them first, e.g.:
source("https://bioconductor.org/biocLite.R")
biocLite()
biocLite('package')
library(knitr)
library(plyr) # check the tidyverse packages if you're not familiar with them yet
library(MatrixEQTL) # run regression analysis efficiently for millions to billions of tests
# library(data.table) # reads, writes, processes data frames very quickly and can handle large sets
In the previous exercise we analysed candidate markers using regression association tests. Here, we’ll perform the same type of analysis for thousands of genetic variants and molecular phenotypes.
The data we’ll be using consists of (badly) simulated sets. We have only simulated a few characteristics of the data but it is not an attempt to characterise genetic or molecular markers in a serious way. The data contains single nucleotide polymorphisms and molecular data (such as gene expression or metabolite data points).
We can assume that the data has already been quality controlled and all the pre-processing steps have been carried out.
What are the names of the input files?
Name them once here and if they change you only need to change the code once. (e.g. if you get new data, start a different project using the same analysis code, etc.).
input_gex <- '../data/cont_var_sim_data.tsv'
input_SNP <- '../data/plink.A-transpose.matrixQTL.geno'
input_covariates <- '../data/cov_sim_data.tsv'
When say covariates in this context, what are we actually referring to?
Here we’ll assume the data is from metabolites so we won’t include locations.
#input_snpsPos <- '../data/snpsloc.txt'
#input_genePos <- '../data/geneloc.txt'
We’ll be saving some results, we can define the names of the files here:
output_file_name <- sprintf('%s/QTL_analysis_larger.mxqtl.tsv', path_results)
output_file_name
[1] "../my_results//QTL_analysis_larger.mxqtl.tsv"
output_file_name.cis <- sprintf('%s/QTL_analysis_larger.mxqtl.cis.tsv', path_results)
output_file_name.cis
[1] "../my_results//QTL_analysis_larger.mxqtl.cis.tsv"
Let’s look at the molecular data:
gex_data <- read.csv(input_gex, header = TRUE, stringsAsFactors = FALSE, sep = '\t')
gex_data[1:5, 1:5]
dim(gex_data)
[1] 2000 10001
head(gex_data)
tail(gex_data)
str(gex_data, list.len = 10)
'data.frame': 2000 obs. of 10001 variables:
$ X : chr "var0" "var1" "var2" "var3" ...
$ per0 : num 2.13 2.02 2.13 2 2.14 ...
$ per1 : num 2.02 2.1 2.01 2 2.11 ...
$ per2 : num 2.09 2.02 2.09 2.03 2.07 ...
$ per3 : num 2.06 2.16 2.06 2.03 2.07 ...
$ per4 : num 2.06 2.03 2.02 2.03 2.1 ...
$ per5 : num 2.01 2.11 2.14 2.1 2.07 ...
$ per6 : num 2.24 2.06 2.14 2.02 2.01 ...
$ per7 : num 2.07 2.08 2.04 2.03 2.13 ...
$ per8 : num 2.14 2.02 2.05 2.09 2.03 ...
[list output truncated]
class(gex_data)
[1] "data.frame"
colnames(gex_data)[1:10]
[1] "X" "per0" "per1" "per2" "per3" "per4" "per5" "per6" "per7" "per8"
rownames(gex_data)[1:10]
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
Let’s plot a few of the variables just to check that our dataset is as we expect:
boxplot(gex_data[, 2:6])
boxplot(gex_data[, (ncol(gex_data)-5):ncol(gex_data)])
What are we actually plotting here? What are the columns and what are the rows?
What if we want to see how one (or more) of the measured molecules looks across all individuals?
This is a very quick look at the data, just for sanity, as we should have already made some pretty plots during the QC stage.
boxplot(t(gex_data[, -1]))
Let’s look at the genetic data:
SNP_data <- read.csv(input_SNP, header = TRUE, stringsAsFactors = FALSE, sep = '\t')
SNP_data[1:5, 1:5]
head(SNP_data)
tail(SNP_data)
str(SNP_data, list.len = 10)
'data.frame': 1000 obs. of 10001 variables:
$ SNP : chr "snp0" "snp1" "snp2" "snp3" ...
$ per0_per0 : int 2 1 2 1 2 0 1 1 1 1 ...
$ per1_per1 : int 0 1 1 1 0 0 2 0 2 0 ...
$ per2_per2 : int 1 1 1 1 1 1 1 1 1 0 ...
$ per3_per3 : int 0 1 2 0 1 2 2 0 2 1 ...
$ per4_per4 : int 1 0 2 1 1 1 1 0 1 1 ...
$ per5_per5 : int 1 0 2 1 0 1 1 1 1 2 ...
$ per6_per6 : int 1 2 0 2 1 2 2 1 1 2 ...
$ per7_per7 : int 0 2 1 2 0 1 1 0 0 2 ...
$ per8_per8 : int 2 2 2 1 1 2 0 2 1 1 ...
[list output truncated]
dim(SNP_data)
[1] 1000 10001
class(SNP_data)
[1] "data.frame"
colnames(SNP_data)[1:10]
[1] "SNP" "per0_per0" "per1_per1" "per2_per2" "per3_per3" "per4_per4" "per5_per5"
[8] "per6_per6" "per7_per7" "per8_per8"
rownames(SNP_data)[1:10]
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
Let’s plot a few of the variables just to check quickly that our dataset is as we expect:
boxplot(SNP_data[, 2:6])
boxplot(SNP_data[, (ncol(SNP_data)-5):ncol(SNP_data)])
What is this about?
SNP data in this file looks more like count data (number of minor alleles at that particular genomic location).
Simple counts or a table might be better here:
as.data.frame(apply(SNP_data, 1, function(x) count(x))[1]) # rows
Let’s look at the covariates data:
cov_data <- read.csv(input_covariates, header = TRUE, stringsAsFactors = FALSE, sep = '\t')
cov_data[1:5, 1:5]
dim(cov_data)
[1] 10 10001
head(cov_data)
tail(cov_data)
str(cov_data, list.len = 10)
'data.frame': 10 obs. of 10001 variables:
$ X : chr "var0" "var1" "var2" "var3" ...
$ per0 : num 2.72 15.03 10.98 7.38 9.84 ...
$ per1 : num 7.75 5.78 10.03 12.54 1.46 ...
$ per2 : num 4.64 -1.87 10.24 8.58 2.81 ...
$ per3 : num 6.93 18.24 10.1 8.73 13.48 ...
$ per4 : num 7.34 10.97 13.17 2.01 12.99 ...
$ per5 : num 9.87 18.72 15.21 14.33 -1.42 ...
$ per6 : num 15.112 12.682 13.229 18.171 0.999 ...
$ per7 : num 2.57 6.4 4.12 1.19 6.36 ...
$ per8 : num 6.17 19.91 6.4 13.81 21.99 ...
[list output truncated]
class(cov_data)
[1] "data.frame"
colnames(cov_data)[1:10]
[1] "X" "per0" "per1" "per2" "per3" "per4" "per5" "per6" "per7" "per8"
rownames(cov_data)[1:10]
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
What could these variables represent?
We can assume we ran principle components analysis on our data beforehand and we’ll use these as possible confounders.
summary(cov_data[, 1:10])
X per0 per1 per2 per3
Length:10 Min. : 1.633 Min. : 1.461 Min. :-1.873 Min. : 6.749
Class :character 1st Qu.: 7.253 1st Qu.: 6.615 1st Qu.: 5.546 1st Qu.: 6.936
Mode :character Median : 8.711 Median :10.358 Median : 8.698 Median : 9.411
Mean : 8.504 Mean : 9.513 Mean : 7.667 Mean :10.533
3rd Qu.:10.695 3rd Qu.:12.397 3rd Qu.:11.078 3rd Qu.:13.542
Max. :15.027 Max. :15.039 Max. :12.137 Max. :18.242
per4 per5 per6 per7 per8
Min. : 1.532 Min. :-1.418 Min. : 0.9987 Min. : 1.187 Min. : 6.173
1st Qu.: 3.437 1st Qu.: 8.530 1st Qu.: 9.4968 1st Qu.: 4.678 1st Qu.: 7.314
Median : 8.167 Median :11.189 Median :12.9551 Median : 8.515 Median :11.592
Mean : 8.207 Mean :10.131 Mean :12.2551 Mean : 8.791 Mean :12.552
3rd Qu.:12.483 3rd Qu.:13.929 3rd Qu.:17.4062 3rd Qu.:13.007 3rd Qu.:17.288
Max. :15.598 Max. :18.722 Max. :20.1949 Max. :15.691 Max. :21.995
Let’s double check that the order between files matches, often this is essential as silent errors can happen if they don’t.
identical(colnames(gex_data), colnames(SNP_data))
[1] FALSE
identical(colnames(gex_data), colnames(cov_data))
[1] TRUE
identical(rownames(gex_data), rownames(cov_data))
[1] FALSE
Why are all these false? Is it a problem with the actual order or with the variable and row names?
MatrixEQTL is an R package (and its paper) that can handle billions of comparisons efficiently. It runs regression tests on SNP data and a continuous variable such as gene expression or metabolite abundance levels. See the tutorial online and the original paper to understand more about the use of linear algebra and what each of the parameters means. Some of the data that we are using are from their examples.
We’ll set up the parameters that we need for MatrixEQTL here. I’ve commented out some of the variables we don’t need to set in this particular analysis:
useModel <- modelLINEAR # modelANOVA or modelLINEAR or modelLINEAR_CROSS
# The p-value threshold determines which gene-SNP associations are saved in the output file.
# Note that for larger datasets the threshold should be lower. Setting the threshold to a high value for a large dataset may cause excessively large output files.
pvOutputThreshold <- 1e-5
# pvOutputThreshold.cis <- 1e-3
# Determine the distance between genes and SNPs to define cis and trans:
# cisDist <- 1e3
# Define the covariance matrix for the error term.
# Consider an error covariance matrix if necessary (correlated variables or errors)
# This parameter is rarely used. If the covariance matrix is a multiple of identity, set it to numeric().
errorCovariance <- numeric()
In the previous sections we loaded the data to inspect it. We’ll do so again here using MatrixEQTL. We could have saved that earlier step though if we had already explored our data and were now running the QTL directly. We could also have used the functions that MatrixEQTL provides instead and explored it here.
snps = SlicedData$new();
snps$fileDelimiter = "\t"; # the TAB character
snps$fileOmitCharacters = "NA"; # denote missing values;
snps$fileSkipRows = 1; # one row of column labels
snps$fileSkipColumns = 1; # one column of row labels
snps$fileSliceSize = 2000; # read file in pieces of 2,000 rows
snps$LoadFile(input_SNP)
Rows read: 1000 done.
snps
SlicedData object. For more information type: ?SlicedData
Number of columns: 10000
Number of rows: 1000
Data is stored in 1 slices
Top left corner of the first slice (up to 10x10):
per0_per0 per1_per1 per2_per2 per3_per3 per4_per4 per5_per5 per6_per6 per7_per7
snp0 2 0 1 0 1 1 1 0
snp1 1 1 1 1 0 0 2 2
snp2 2 1 1 2 2 2 0 1
snp3 1 1 1 0 1 1 2 2
snp4 2 0 1 1 1 0 1 0
snp5 0 0 1 2 1 1 2 1
snp6 1 2 1 2 1 1 2 1
snp7 1 0 1 0 0 1 1 0
snp8 1 2 1 2 1 1 1 0
snp9 1 0 0 1 1 2 2 2
per8_per8 per9_per9
snp0 2 0
snp1 2 1
snp2 2 1
snp3 1 2
snp4 1 0
snp5 2 0
snp6 0 1
snp7 2 2
snp8 1 1
snp9 1 1
gene = SlicedData$new();
gene$fileDelimiter = "\t"; # the TAB character
gene$fileOmitCharacters = "NA"; # denote missing values;
gene$fileSkipRows = 1; # one row of column labels
gene$fileSkipColumns = 1; # one column of row labels
gene$fileSliceSize = 2000; # read file in pieces of 2,000 rows
gene$LoadFile(input_gex);
Rows read: 2,000
Rows read: 2000 done.
gene
SlicedData object. For more information type: ?SlicedData
Number of columns: 10000
Number of rows: 2000
Data is stored in 1 slices
Top left corner of the first slice (up to 10x10):
per0 per1 per2 per3 per4 per5 per6 per7 per8
var0 2.132294 2.018942 2.088461 2.063568 2.057563 2.008245 2.240915 2.065163 2.138533
var1 2.017955 2.098512 2.022354 2.160059 2.033293 2.106890 2.058435 2.077967 2.021689
var2 2.125305 2.010552 2.092786 2.061109 2.024689 2.144936 2.135578 2.036396 2.052742
var3 2.003505 2.004717 2.030975 2.030150 2.029121 2.103465 2.024834 2.033229 2.085112
var4 2.137023 2.112701 2.065903 2.071997 2.095170 2.071265 2.008435 2.131564 2.027557
var5 2.068027 2.073748 2.221624 2.007325 2.042881 2.218519 2.069544 2.101766 2.009634
var6 2.148211 2.012304 2.042925 2.160128 2.015015 2.248971 2.009899 2.084554 2.125608
var7 2.173883 2.199776 2.124398 2.026164 2.036846 2.126623 2.091552 2.108369 2.012217
var8 2.106123 2.083136 2.205930 2.113247 2.020471 2.055118 2.071054 2.068571 2.041257
var9 2.046247 2.168801 2.041953 2.071652 2.074716 2.173921 2.067337 2.109000 2.122996
per9
var0 2.009990
var1 2.069470
var2 2.020616
var3 2.046913
var4 2.030127
var5 2.009193
var6 2.169123
var7 2.132810
var8 2.087303
var9 2.031786
cvrt = SlicedData$new();
cvrt$fileDelimiter = "\t"; # the TAB character
cvrt$fileOmitCharacters = "NA"; # denote missing values; This is from the plink encoding.
cvrt$fileSkipRows = 1; # one row of column labels
cvrt$fileSkipColumns = 1; # one column of row labels
cvrt$fileSliceSize = 2000; # read file in pieces of 2,000 rows
cvrt$LoadFile(input_covariates);
Rows read: 10 done.
cvrt
SlicedData object. For more information type: ?SlicedData
Number of columns: 10000
Number of rows: 10
Data is stored in 1 slices
Top left corner of the first slice (up to 10x10):
per0 per1 per2 per3 per4 per5 per6 per7
var0 2.716564 7.754125 4.642621 6.926951 7.343646 9.869086 15.1116143 2.571390
var1 15.027063 5.782866 -1.872673 18.242020 10.969449 18.722471 12.6817509 6.395979
var2 10.979109 10.026883 10.237184 10.097399 13.165302 15.210605 13.2285458 4.115239
var3 7.378765 12.535880 8.579287 8.725308 2.007193 14.330798 18.1710893 1.186859
var4 9.844247 1.461214 2.813233 13.475994 12.986890 -1.417745 0.9987122 6.364541
var5 1.632863 10.688924 12.137117 13.563648 2.134887 10.509293 10.1513451 15.690753
var6 9.224804 15.038707 11.706595 6.963588 8.593823 12.724256 9.2786732 13.368394
var7 12.827653 11.980328 8.255040 13.700810 1.531767 8.084301 3.1051125 10.633310
var8 7.210589 13.623182 8.817582 6.748911 7.739639 1.410529 20.1948945 11.921878
var9 8.197405 6.235491 11.358322 6.886067 15.598166 11.869307 19.6296573 15.661149
per8 per9
var0 6.173441 3.663875
var1 19.906280 7.129417
var2 6.399474 10.664086
var3 13.814093 7.189667
var4 21.994677 6.158198
var5 18.446367 11.648068
var6 8.782872 7.137007
var7 13.137208 12.944661
var8 10.045985 -3.368629
var9 6.823724 7.630835
# Check files:
str(cvrt)
Reference class 'SlicedData' [package "MatrixEQTL"] with 9 fields
$ dataEnv :<environment: 0x185738d50>
$ nSlices1 : int 1
$ rowNameSlices :List of 1
..$ : chr [1:10] "var0" "var1" "var2" "var3" ...
$ columnNames : chr [1:10000] "per0" "per1" "per2" "per3" ...
$ fileDelimiter : chr "\t"
$ fileSkipColumns : num 1
$ fileSkipRows : num 1
$ fileSliceSize : num 2000
$ fileOmitCharacters: chr "NA"
and 41 methods, of which 27 are possibly relevant:
Clear, Clone, ColumnSubsample, CombineInOneSlice, CreateFromMatrix, FindRow,
GetAllRowNames, GetNRowsInSlice, getSlice, getSliceRaw, initialize, IsCombined, LoadFile,
nCols, nRows, nSlices, ResliceCombined, RowMatrixMultiply, RowRemoveZeroEps, RowReorder,
RowReorderSimple, RowStandardizeCentered, SaveFile, SetNanRowMean, setSlice, setSliceRaw,
show#envRefClass
str(snps)
Reference class 'SlicedData' [package "MatrixEQTL"] with 9 fields
$ dataEnv :<environment: 0x1a6893c60>
$ nSlices1 : int 1
$ rowNameSlices :List of 1
..$ : chr [1:1000] "snp0" "snp1" "snp2" "snp3" ...
$ columnNames : chr [1:10000] "per0_per0" "per1_per1" "per2_per2" "per3_per3" ...
$ fileDelimiter : chr "\t"
$ fileSkipColumns : num 1
$ fileSkipRows : num 1
$ fileSliceSize : num 2000
$ fileOmitCharacters: chr "NA"
and 41 methods, of which 27 are possibly relevant:
Clear, Clone, ColumnSubsample, CombineInOneSlice, CreateFromMatrix, FindRow,
GetAllRowNames, GetNRowsInSlice, getSlice, getSliceRaw, initialize, IsCombined, LoadFile,
nCols, nRows, nSlices, ResliceCombined, RowMatrixMultiply, RowRemoveZeroEps, RowReorder,
RowReorderSimple, RowStandardizeCentered, SaveFile, SetNanRowMean, setSlice, setSliceRaw,
show#envRefClass
str(gene)
Reference class 'SlicedData' [package "MatrixEQTL"] with 9 fields
$ dataEnv :<environment: 0x18537aed8>
$ nSlices1 : int 1
$ rowNameSlices :List of 1
..$ : chr [1:2000] "var0" "var1" "var2" "var3" ...
$ columnNames : chr [1:10000] "per0" "per1" "per2" "per3" ...
$ fileDelimiter : chr "\t"
$ fileSkipColumns : num 1
$ fileSkipRows : num 1
$ fileSliceSize : num 2000
$ fileOmitCharacters: chr "NA"
and 41 methods, of which 27 are possibly relevant:
Clear, Clone, ColumnSubsample, CombineInOneSlice, CreateFromMatrix, FindRow,
GetAllRowNames, GetNRowsInSlice, getSlice, getSliceRaw, initialize, IsCombined, LoadFile,
nCols, nRows, nSlices, ResliceCombined, RowMatrixMultiply, RowRemoveZeroEps, RowReorder,
RowReorderSimple, RowStandardizeCentered, SaveFile, SetNanRowMean, setSlice, setSliceRaw,
show#envRefClass
dim(cvrt)
[1] 10 10000
dim(snps)
[1] 1000 10000
dim(gene)
[1] 2000 10000
Call MatrixEQTL’s main function:
results_mxqtl <- Matrix_eQTL_main(
snps = snps,
gene = gene,
cvrt = cvrt,
output_file_name = output_file_name,
# output_file_name.cis = output_file_name.cis,
useModel = useModel,
errorCovariance = errorCovariance,
verbose = TRUE,
pvalue.hist = 'qqplot',
min.pv.by.genesnp = FALSE,
noFDRsaveMemory = FALSE,
pvOutputThreshold = pvOutputThreshold,
# pvOutputThreshold.cis = pvOutputThreshold.cis,
# snpspos = snpsPos_data,
# genepos = probePos_data,
# cisDist = cisDist
)
Processing covariates
Task finished in 0.01 seconds
Processing gene expression data (imputation, residualization, etc.)
Task finished in 1.064 seconds
Creating output file(s)
Task finished in 0.017 seconds
Performing eQTL analysis
100.00% done, 23 eQTLs
Task finished in 15.93 seconds
We can inspect the results now. Each significant gene-SNP association is recorded in a separate line in the output file and in the returned object results_mxqtl. In case of cis/trans eQTL analysis described below, two output files are produced, one with cis-eQTLs, another only with trans. Every record contains a SNP name, a transcript name, estimate of the effect size, t- or F-statistic, p-value, and FDR.
show(results_mxqtl$all$eqtls) # The output will be NULL if we specified a cis distance
snps gene statistic pvalue FDR beta
1 snp286 var1242 5.221159 1.813826e-07 0.3627652 0.004452213
2 snp419 var1362 -4.966020 6.947178e-07 0.6947178 -0.004286616
3 snp165 var417 -4.684836 2.838838e-06 0.7965449 -0.004003128
4 snp577 var1766 4.640934 3.512541e-06 0.7965449 0.003983440
5 snp791 var1971 -4.631607 3.674224e-06 0.7965449 -0.003961242
6 snp723 var1594 4.591295 4.458863e-06 0.7965449 0.003897441
7 snp530 var207 -4.590588 4.473963e-06 0.7965449 -0.003905474
8 snp151 var1340 4.581169 4.679849e-06 0.7965449 0.003839068
9 snp639 var1017 4.577914 4.753095e-06 0.7965449 0.003952824
10 snp744 var490 -4.571621 4.897808e-06 0.7965449 -0.003844988
11 snp857 var543 -4.566104 5.028153e-06 0.7965449 -0.003939186
12 snp255 var588 4.565362 5.045942e-06 0.7965449 0.003880188
13 snp417 var508 4.549140 5.450082e-06 0.7965449 0.003906796
14 snp644 var924 4.540161 5.686908e-06 0.7965449 0.003919457
15 snp276 var1253 -4.500937 6.842033e-06 0.7965449 -0.003847994
16 snp564 var72 4.493805 7.074878e-06 0.7965449 0.003822842
17 snp494 var1421 -4.492368 7.122676e-06 0.7965449 -0.003819058
18 snp435 var1384 4.490988 7.168904e-06 0.7965449 0.003836114
19 snp573 var735 4.455928 8.444064e-06 0.8425041 0.003789458
20 snp210 var1164 -4.444020 8.924478e-06 0.8425041 -0.003782566
21 snp156 var1454 4.437396 9.202921e-06 0.8425041 0.003767391
22 snp876 var1891 4.433567 9.367666e-06 0.8425041 0.003753055
23 snp403 var1879 4.426284 9.688797e-06 0.8425041 0.003816646
head(results_mxqtl$all$eqtls)
plot(results_mxqtl)
head(results_mxqtl$cis$eqtls)
NULL
nrow(results_mxqtl$cis$eqtls)
NULL
head(results_mxqtl$cis$ntests)
NULL
head(results_mxqtl$trans$eqtls)
NULL
nrow(results_mxqtl$trans$eqtls)
NULL
head(results_mxqtl$trans$ntests)
NULL
What could we do next? Annotation, downstream analysis, plots for publication, follow-up of significant results.
Questions? Corrections? Comments?
R Markdown and Notebook:
Markdown language:
QTLs:
Programming with R best practice:
We use Software Carpentry materials for some of the examples and code. Please take a look at their webpage and lessons if you are interested.